Ziyuan Huang
Last Updated: November 13, 2025
ANLY 500 will focus on the foundations of:
The first question we must ask ourselves in this course is: What is Analytics?
We should note analytics can be defined in two ways!
The utilization of:
The focus of data analytics can be defined under three scopes, including:
Dataset: “Sunspot Trends from 1749-01-01 to 2013-09-01’”
Description: Understand the Historical Trend of Sunspots from 1749 to 2013.
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1749 96.7 104.3 116.7 92.8 141.7 139.2 158.0 110.5 126.5 125.8
## Time-Series [1:3310] from 1749 to 2025: 96.7 104.3 116.7 92.8 141.7 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 24.20 68.00 81.99 122.70 398.20
library(ggplot2)
library(dplyr)
# Convert to tibble with time index
sunspot_df <- tibble(
time = seq_along(sunspot.month),
sunspots = as.numeric(sunspot.month)
)
ggplot(sunspot_df, aes(x = time, y = sunspots)) +
geom_point(alpha = 0.5, color = "steelblue") +
labs(
y = "Number of Sunspots",
x = "Time (Months since 1749)",
title = "Historical Sunspot Activity"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))library(quantmod)
library(lubridate)
# Define date range (last 5 years)
start_date <- Sys.Date() - years(5)
end_date <- Sys.Date() - days(2)
# Fetch stock data with error handling
suppressWarnings(
getSymbols("AMZN", src = "yahoo", from = start_date, to = end_date, auto.assign = TRUE)
)## [1] "AMZN"
## An xts object on 2020-11-13 / 2025-11-10 containing:
## Data: double [1253, 6]
## Columns: AMZN.Open, AMZN.High, AMZN.Low, AMZN.Close, AMZN.Volume ... with 1 more column
## Index: Date [1253] (TZ: "UTC")
## xts Attributes:
## $ src : chr "yahoo"
## $ updated: POSIXct[1:1], format: "2025-11-13 17:47:41"
# Split data for training (first 1199 observations)
train_size <- min(1199, nrow(AMZN))
train_data <- AMZN[1:train_size, ]
# Build predictive model
predictive_model <- lm(
formula = AMZN.Close ~ AMZN.High + AMZN.Low + AMZN.Volume,
data = train_data
)
summary(predictive_model)##
## Call:
## lm(formula = AMZN.Close ~ AMZN.High + AMZN.Low + AMZN.Volume,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9513 -0.8685 -0.0221 0.9510 9.3134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.445e-02 2.484e-01 -0.058 0.954
## AMZN.High 5.268e-01 2.427e-02 21.701 <2e-16 ***
## AMZN.Low 4.732e-01 2.473e-02 19.137 <2e-16 ***
## AMZN.Volume -8.230e-10 1.845e-09 -0.446 0.656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.415 on 1195 degrees of freedom
## Multiple R-squared: 0.9985, Adjusted R-squared: 0.9985
## F-statistic: 2.704e+05 on 3 and 1195 DF, p-value: < 2.2e-16
# Diagnostic plots for model assessment
par(mfrow = c(2, 3), mar = c(4, 4, 2, 1))
plot(predictive_model, which = 1, main = "Residuals vs Fitted")
plot(predictive_model, which = 2, main = "Q-Q Plot")
plot(predictive_model, which = 3, main = "Scale-Location")
plot(predictive_model, which = 4, main = "Cook's Distance")
plot(predictive_model, which = 5, main = "Residuals vs Leverage")
par(mfrow = c(1, 1)) # Reset layout# Make predictions on test set
n <- nrow(AMZN)
test_start <- min(1200, n)
test_data <- AMZN[test_start:n, ]
prediction <- predict(predictive_model, newdata = test_data)
# Display last predictions
tail(data.frame(
predicted_close = prediction,
actual_close = test_data$AMZN.Close
))# Visualize predictions
plot(prediction,
type = "l",
col = "steelblue",
lwd = 2,
main = "Predicted Amazon Closing Prices",
xlab = "Time Index",
ylab = "Predicted Close Price ($)",
las = 1)
grid(col = "gray90")Analytics is the discovery, interpretation and communication of meaningful patterns or summary of data using data analytics.
Now we should be asking the question: What is Data Analytics?
High level analysis techniques commonly used in data analytics include:
However, two other types of analysis may be considered.
Quantitative data analysis: involves analysis of numerical data with quantifiable variables that can be compared or measured statistically.
Qualitative data analysis: it is more interpretive. It focuses on understanding the content of non-numerical data like text, images, audio and video, including common phrases, themes and points of view.
In other words, formulate a question that needs to be answered.
Test the concept:
Theory:
Hypothesis:
Falsification:
Independent Variable:
Dependent Variable:
Data is a set of values/measurements of quantitative or qualitative variables.
In a dataset, we can distinguish two types of variables:
Definition - entities that are divided into distinct categories.
Includes the following:
R stores categorical variables as a factor or character.
Factors are the variables in R which take on a limited number of different values.
Definition - a binary variable is only two categories.
Definition - A nominal variable is more than two categories.
Definition - A ordinal variable is the same as a nominal, but the categories have a logical order.
In addition to being able to classify values into categories, you can order the categories: first, second, third
Definition - entities get a distinct score.
Includes the following:
Definition - A interval variable is equal intervals on the variable. It represents equal differences in the property being measured. This variable also does not have a true zero.
Definition - A ratio variable is the same as an interval variable, but the ratios of scores on the scale must also make sense. This variable does have a true zero.
The accuracy of the measurements are key to your solutions.
Measurement Error: - aka observational error
Definition - The discrepancy between the actual value we’re trying to measure, and the number we use to represent that value.
Validity:
Including the following:
Reliability:
Test-Retest Reliability:
To use measures in any research and test them we must now understand the following: How to Measure?
It is different for certain types of research, including:
Definition - One or more variables is systematically manipulated to see their effect (alone or in combination) on an outcome variable.
Cause and Effect (Hume, 1748)
Confounding variables: the ‘Tertium Quid’
Ruling out confounds (Mill, 1865)
Considering the what & how to measure, we must now look at the methods of data collection.
For instance:
Between-group/between-subject/independent
Repeated-measures (within-subject)
Systematic Variation
Unsystematic Variation
Randomization
First, populations and samples should be understood so that your analysis is not misleading when interpreting results.
Population
Sample
A simple statistical model can be used to analyze data.
For instance, the mean is a hypothetical value.
library(dplyr)
# Calculate mean sepal length by species
iris %>%
group_by(Species) %>%
summarise(mean_sepal_length = mean(Sepal.Length)) %>%
knitr::kable(digits = 2, caption = "Mean Sepal Length by Species")| Species | mean_sepal_length |
|---|---|
| setosa | 5.01 |
| versicolor | 5.94 |
| virginica | 6.59 |
The numbers estimated from a single test/study/experiment are considered a sample.
Parameters = Greek Symbols
Statistics = Latin Letters
library(dplyr)
# Set seed for reproducibility
set.seed(123)
# Random sample of 15 observations
iris_sample <- iris %>%
slice_sample(n = 15)
# Compare sample vs population means
bind_rows(
iris_sample %>%
group_by(Species) %>%
summarise(mean_sepal_length = mean(Sepal.Length)) %>%
mutate(type = "Sample (n=15)"),
iris %>%
group_by(Species) %>%
summarise(mean_sepal_length = mean(Sepal.Length)) %>%
mutate(type = "Population (n=150)")
) %>%
select(type, Species, mean_sepal_length) %>%
knitr::kable(digits = 3, caption = "Sample vs Population Comparison")| type | Species | mean_sepal_length |
|---|---|---|
| Sample (n=15) | setosa | 4.660 |
| Sample (n=15) | versicolor | 5.660 |
| Sample (n=15) | virginica | 6.440 |
| Population (n=150) | setosa | 5.006 |
| Population (n=150) | versicolor | 5.936 |
| Population (n=150) | virginica | 6.588 |
To analyze the data and generate interpretable results the following statistical models can be used:
In this lecture, you have learned: